Statistical Mechanics of the Chinese Restaurant Process: 
lack of self-averaging, anomalous finite-size effects and condensation 
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The Pitman- Yor, or Chinese Restaurant Process, is a stochastic process that generates distri- 
butions following a power-law with exponents lower than two, as found in a numerous physical, 
biological, technological and social systems. We discuss its rich behavior with the tools and view- 
point of statistical mechanics. We show that this process invariably gives rise to a condensation, i.e. 
a distribution dominated by a finite number of classes. We also evaluate thoroughly the finite-size 
effects, finding that the lack of stationary state and self-averaging of the process creates realization- 
dependent cutoffs and behavior of the distributions with no equivalent in other statistical mechanical 
models. 
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Despite of their extreme behavior, power-law tailed 
probability distributions empirically describe the parti- 
tioning (i.e. the organization into sub-classes) of a large 
variety of physical, biological, technological and social 
systems |l|, an d the degree distributions of many com- 
plex networks 0, Q ■ The most extreme case are the sys- 
tems where the power laws pk have exponent 7 between 
one and two. Indeed, if 7 6 [1,2] both the first and 
the second moment of the distribution diverge, meaning 
that the system is so biased that neither averages nor 
fluctuations are well-behaved. This seems to happen for 
example for words in a text or population of cities (Zipf 's 
law P, 0, @), the size of classes of homolog proteins, the 
out-degree of transcription networks @ , the frequency of 
family names [l[ , and of degree distribution of different 
social/technological networks 0]. 

While the processes leading to power-law distributions 
are diverse [1|, there are only few available models that 
capture this behavior. In particular, it is useful to formu- 
late models that help the understanding of these distribu- 
tions in the framework of non-equilibrium growth laws. 
The paradigm is the Yule/Simon process [g, 0], which 
describes an evolving system of a growing number of el- 
ements, where the number of classes grows linearly with 
the elements. It is based on the general mechanism of the 
"Matthew effect" , or "cumulative advantage" : with time, 
more populated classes acquire new elements with higher 
relative rate. The Yule/Simon process generates power- 
law distributions with exponents 7 € (2,3] [l[. Barabasi 
and Albert [Io| have shown that a similar mechanism can 
be used to generate power-law networks with the same 
exponents that grow and evolve through preferential at- 
tachment @, 0, E3] . 

While in some systems where exponents 7 < 2 
are observed it can be argued that preferential at- 
tachment is present, this case is not predicted by the 
Yule/Simon/Barabasi-Albert (YSBA) model. These dis- 
tributions are more biased towards highly populated 



classes, so that in order to obtain this behavior one has 
to reweigh the balance of growth and preferential attach- 
ment in favor of the latter, and in particular consider 
processes where the number of classes grows sub-linearly 
with the number of elements. 

The process we consider here, called Pitman- Yor, or 
Chinese Restaurant Process (CRP) [HI, [12], has exactly 
this property, which makes it reproduce power-laws with 
7 G (1,2]. It is commonly used in the mathematics lit- 
erature, but relatively disregarded by the physics com- 
munity, and in particular unexplored using the tools of 
statistical mechanics. It is defined as a discrete-time 
stochastic process generating a partition of a number 
elements in classes, such that each element belongs to 
a given class. In probability, the CRP is used for ex- 
ample as a prior in nonparametric Bayesian methods 
and its has been applied to multiple problems ranging 
from modeling texts to genetics and functional genomics 
13, 14 > 15, 16, 17 1 . It also maps to a non self- averaging 
stick-breaking process [H, [II]. Recently, we observed 
that CRP-like processes model well the evolution protein 
domain families [Hi ] and reproduce the scaling laws found 
by genomics methods, which adds a strong motivation to 
explore them. 

The mathematical characterization of the CRP has 



been carried out [12j with special attention to the asymp- 
totic of the process at large times T — > 00. In this Letter, 
we characterize this process with the tools of statistical 
mechanics. First, we argue that the CRP always exhibits 

a condensation phenomenon m eh, m, m, m, m m 

with few classes dominating the total population. We 
relate the condensation observed in the CRP process to 
other known mechanisms taking place in the well-studied 
phenomenolo gy o f the Zero- Range Process [12] , or in net- 
work models [2l| . Second, we present a calculation that 
shows how the process behaves for large but finite times 
T, finding anomalous finite-size corrections to the asymp- 
totic formulas. Unlike the YSBA model 27j, the lack of 
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self-averaging of the CRP determines for certain param- 
eter values a nontrivial and realization-dependent finite- 
size behavior which is our main finding. Thus, the CRP 
fills two important gaps in the fundamental statistical 
mechanical understanding of non-self averaging phenom- 
ena, power-law distributions, and condensed states. 

General Considerations. At each time T, the CRP gen- 
erates a partition over integers {1, 2, . . . T} into different 
classes. Differently from other models [Il|,[22j], the num- 
ber of classes N(T) is a stochastic variable depending on 
the realization. 

The process is anecdotally a problem of customers en- 
tering a Chinese Restaurant with table-sharing, where 
the number of tables and guests per table are unbounded. 
Assuming that in the restaurant there are T customers 
sitting at N(T) tables with ki customers in each table 
i = 1,2, ... N(T). At time T a new customer enters 
the restaurant and either sits at table i — 1, . . . ,N(T) 
with probability pt (in this case N(T + 1) = N(T)), or 
chooses a new table with probability Pn(t)+i ( m this case 
N(T + 1) = N(T) + 1). The probability p t and Pjv(t)+i 
in the CRP are given by 
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T + e 



Pn(t)+i 



aN(T) + 9 
T + 9 ' 



(1) 



where 9 > and a 6 [0, 1). 

As in the YSBA model, the CRP includes growth of 
elements T and classes N, and a preferential attachment 
principle, because more populated tables are more likely 
to acquire new customers. However, in the CRP, growth 
of classes is not constrained to preferential attachment 
of new class members, but these two processes are de- 
coupled, as witnessed by the fact that the probability to 
add a new table is not constant. This probability decays 
with the number of guests T, increasing the weight of 
"hub classes" , which is the essential ingredient to repro- 
duce power-law distributions with exponent lower than 
two. While models producing power- laws with 7 G (1,2] 
exist [2^ , usually they lack the flexibility of the CRP in 
modulating this weight. 

Rephrasing the YSBA model in terms of customers in a 
restaurant, the probabilities pf to sit at a non-empty ta- 
ble i at time T and the probability pff^ +1 to sit at a new 

table at time T ispf = (l — e)ki/T, p%^ +1 = £ where h 
is the occupation number of table i = 1, . . . , N(T). Note 
that this means that new tables are added with statisti- 
cally independent moves, while in the CRP the addition 
of a new table is statistically dependent on the configu- 
ration of the partition [3] . 

The CRP has been studied extensively in the math- 
ematical literature fl2l |. The occupation distribution in 
the limit T —> 00 and k finite and fixed is F oc (k) = 
Furthermore, the statistics of the number 



r(fc-q) 



of the number of tables (N(T)) at time T is given by [IS 
' r(e+i) Ta 
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for a > 
for a = 



In the limit of large T , the full probability distribution 
for N(T) V{N{T)) is known [jj], when a = 0, to be 
a Gaussian of mean m and standard deviation a 2 with 
m = a 2 = 9\og(T). In the case a > , instead, the 
variable s = N(T)/T a asymptotically in time follow the 
Mittag-Lefner distribution Q a .e(s) [12j. This point is par- 
ticularly interesting [I2I |29| because in the asymptotic 
limit, the Mittag-Leffler distribution has finite fluctua- 
tions, implying that the number of tables in the Pitman- 
Yor process with a > is a non self-averaging quantity. 

The CRP is always in a condensed state. We now 
provide an argument comparing the phenomenology of 
the CRP to models exhibiting condensation phenomena. 
Extending the validity of the asymptotic formula i 7 ^ (k) 
for all values of k, we can estimate the occupation of the 
maximally occupied table in the CRP. We observe that 
this table has always occupation k max = 0(T). In fact, 
since F ao (k) ~ fc _Q_1 , we can evaluate the occupation 
kmax of the maximally occupied table by imposing the 
defining condition that that the fraction of tables with 
k > k m ax must be of the order of 1/N, i.e. 



(2) 



k>k„ 



r(fc+i)r(i-oO ■ 

of tables N(T) has been characterized. The average value 



Since in the Chinese restaurant process N = 0(T a ) if 
a > 0, and N = C(log(T)) if a = 0, in both cases this 
estimate indicates that the maximally occupied table has 
a finite occupation k max = 0(T). 

When the maximal occupation of a class is of the same 
order of the total number of elements in the partition, 
one says that the distribution is in a "condensed" phase. 
Reference models studied in the statistical physics com- 
munity are the Zero- Range-Process (ZRP) and the Bose- 
Einstein condensation of networks (BECN) [2l|. In the 
BECN the condensation occurs on a single special node, 
for power-law degree distributions with exponent 7 = 2 
as a consequence of the heterogeneity of the classes. In 
the ZRP, particles hop on 1-D lattice sites according to 
prescribed laws [22j|, generating partitions of elements 
into classes, i.e. clusters of particles, with power-law be- 
havior and exponent 7. Depending on the dynamics and 
particle density, a condensation can occur in the ZRP, 
where one class becomes occupied by a finite fraction of 
elements. 

It is instructive to illustrate the main differences be- 
tween the condensation phenomena occurring in the ZRP 
and in the CRP: First, in the ZRP the exponent 7 of 
the distribution can be larger or smaller than 2, but the 
condensation occurs only if 7 > 2, while in the CRP a 
condensation always occurs and the distribution of the 



3 



partition decays with an exponent 7 = 1 + a < 2. Sec- 
ond, in the 7 > 2 ZRP, the condensation transition is 
driven by the density of particles p: If p > p* there is 
a condensation, if p < p* there is no condensation, and 
the condensate appears in order to balance between the 
imposed finite value of p and the natural average value of 
the power-law. Conversely, in the CRP the mean density 
of elements always diverges, which, in the large T limit 
imposes the existence of classes with a finite fraction of 
the total number of elements. Thus, a relevant difference 
between the ZRP (and BECN) and the CRP is that in 
the CRP there is a degenerate distribution but no phase 
transition. This situation closely resembles the so called 
"pseudo-condensation" found [30( where the condensa- 
tion is characterized in a ZRP with non-extensive number 
of classes. However while in that case the scaling of the 
number of classes with the number of elements is chosen 
ad hoc, in the CRP this scaling is a natural outcome the 
process. 

Conditioned Path Integral of the CRP and anomalous 
finite-size effects. At finite-sizes the a > CRP shows an 
intriguing phenomenology, where the trend of individual 
realizations determines their distribution. This is visible 
from the finite-size scaling of the distribution F(k,T). 
We thus study F(k, N, T), with the additional condition 
of fixed number of tables N . 

The probability of a partition of T elements is the prob- 
ability distribution at T — 1 times the probability of an 
event at time T. Therefore the probability P({ki}) of a 
process from time T = 1 to time T, giving rise to an oc- 
cupation of N tables i = {1, 2, . . . N}, each one occupied 
by ki individuals, is given by the product of the proba- 
bilities (fTJ) for each subsequent event. In particular this 
probability can be written as 



P({ki}) = C N .T J 



r(fc» - a) 
r(l-a) 



(3) 



where S is the Kronecker delta fixing the total num- 
ber of customers and the constant Cn,t is given by 
C N , T = a N T(N + 9/a)T{9)/[T(9/a)T(T + 9)}. Most no- 
tably, the probability P({fc i }) of a process giving rise to 
the occupation numbers {k t }, Eq. ([3]) is independent 
on the history of the process. In this case P({ki}) is 
called a distribution of exchangeable random variables 
fl2| . Moreover, since P({ki}) takes a factorizable form, 
this probability distribution is also referred to as a Gibbs 
measure [l2l ]. 

We can construct a conditioned path integral of this 
process by summing over all the histories keeping T and 
N constant. Since the events in the CRP are exchange- 
able i.e. the probability is invariant for any permu- 
tation of the set of class indexes, we can sum over the 
histories in which the partition {ki} is generated in ran- 
dom order. To account for the number of these histories 
we introduce the multinomial prcfactor T\/ (^\ i ki\ N\). 



This leads to the following expression for the partition 
function Zn t, 



- y 



P({h}) 



(4) 



{fc«}«=i, 

Similarly, the probability F(k,N,T) that in a process 
studied at time T when TV tables are full, a random table 
is occupied by k guests reads 

T! 



F(k,N,T) = 
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:6 kl , k P({h}). (5) 



Or equivalently, 

F(k,N,T) 



Zn,t-u T(fc - a) 

z n ,t r(* + i)r(i - a) 



(6) 



Roughly, the ratio appearing at the r.h.s. in this equa- 
tion is related to the power-law behavior, while the rest 
gives the finite-size corrections. The function Zjv.t can 
be evaluated, for large T, with a saddle-point approxi- 
mation of the integral 
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N,T 



du) 



T(k-a) 



N 



-iuk 



2tt I ^ r(l - a)T(k + 1) 



(7) 



where the integration over ui comes from the Fourier rep- 
resentation of the Kronecker delta of Eq. [31 and the sad- 
dle point iuj* = uj := x/T satisfies the equation 



T_H 1 (T,a, X /T) 
N 



where 

T 

H n (T,a,uj) = ^2 



H (T,a, X /Ty 

T(m-a) 
r(l-a)r(m + l)' 



(8) 



(9) 



For a > 0, the solution to the saddle-point equation 
flgj), x = xi N , T , ") in the limit T -> 00 depends on N, T 
only through the realization-specific variable s = N/T a . 
In order to show this, we first observe that the scaling of 
H n (T, a, x/T) with T, can be studied by approximating 
integral to sums in their definition. Secondly, we show 
that the functions H n (T,a,x/T) can be expressed as 



1 [T n ~ a - 1] 
r(l — a) (n — a) 



H n (T,a, X /T) 

J rpn-a I 

+ r(i- a ) J L a t 



(e~ xt ~ 1) , (10) 



where we have added and subtracted a term of the type 
H n (T, a, 0). Consequently, for large T, H X (T, a, x/T) -> 
T l - a hi{a,x) while H (T,a,x/T) -> h (a). Inserting 
these relations in the saddle point-equation J5]), and tak- 
ing TV = sT a we obtain 



hi(a,x) 
h (a) 



(11) 



proving that X = x( s i a ) m the large T limit for a > 0. 
The final expression for Zn t is therefore given by 



Z 



N,T 



C N , T ^ {s , a) e N ^ H ^ s ^l T > T » 
Nl C ^NJ(T,a, X /T) 



(12) 



where we evaluated the saddle point up to the second 
order, and the function J(T, a, x/T) is given by 



J(T,a, X /T) = 



d 2 \og(H (T,a,Lu) 



duo 2 



(13) 



--X/T 



A similar procedure applies to the evaluation of ZN,T-k, 
with x' = w'/ (T — k) satisfying the saddle point equation 



T-k = H^T-^a^'/jT-k)) 
N H (T-k,a,x'/(T-k)Y 



(14) 



Following arguments similar to the one provided for the 
scaling of x> we can show that x' — x'{k,N,T,a) at 
the saddle point. Equation (fl4l) depends on k,N,T 
only through the variables s = N/T a and k/T, i.e. 
x' = x'( s T^/T,a). Following a similar reasoning to he 
one we adopted for proving that the function Zn^t in 
Eq. (JTSJ) depends exclusively on the parameters s, a, it is 
possible to show that the function Zj^x-k only depends 
on s, k/T, a, i.e. Z^x-k = <^(s, k/T, a). 

Therefore, taking in ^ the large k, T limit with k/T = 
0(1), F(k,N,T) for a > satisfies the scaling relation 



F(k,N, T) 



T 



Ct + 1 

T 

T 7 



^N.T-k 



T(k-a) 



Zn,t r(ft + l)r(l - or) 
q{k/T,s = iV/T a ,a)(15) 



where the function q (containing x'i Xj an d the second 
order corrections) represents the finite-size corrections to 
the asymptotic behavior. These findings shed light on 
the absence of self-averaging in the process. At any given 
time T the process will have finite fluctuations, persistent 
also in the limit of T — > oo. These fluctuations depend 
on the non stationarity of the process, and on the non 
self- averaging value of the number of classes N. There- 
fore the process, if conditioned on the number of classes 
N, shows fluctuations that go to zero as T — * oo. Figure 
[T] compares simulations with the analytical predictions of 
Eq. ([6]). The figure shows that the finite-size correction 
to the power-law tail ~ l/k 1+a , for some s and large 
k/T may increase its value, giving rise to an anomalous 
"bump" in the distribution. On the other hand, this 
local maximum never develops into a concentrated "con- 
densate", and for k/T — > 1, for any s, the cutoff q always 
dampens F. 

In conclusion, we have presented a statistical mechan- 
ics study of the Chinese Restaurant Process, which gen- 
erates power-law distributions with exponents 7 £ (1,2] 
by noncquilibrium growth, a condensation phenomena, 




FIG. 1: (Color online) Rescaled distribution of the occupation 
numbers in the Chinese Restaurant Process with a = 0.1 
and 9 = 1 at different times T and number of tables iV = 
sT a . We report the log-binned distribution F(k,N,T) for 
s = N/T a with s = 3.5 (black symbols, dotted line) s = 5 
(red symbols, solid line) s = 7.5 (blue symbols, dashed line) 
s = 10 (brown symbols, dashed-dot line). The rescaled data 
are shown for processes with T — 2500 (triangles) ,T — 5000 
(squares), T = 10 4 (circles). The solid line show the analytical 
solutions calculated by solving the saddle point equations JBJ, 
(Til for T = 2500. 



absence of selfaverging, and anomalous finite-size effects. 
We believe that this rich process will be of importance 
for future developments of the field where these trends 
occur, i.e. biological evolution, complex systems, spin- 
glasses and noncquilibrium phenomena. 

G.B acknowledges support from the 1ST STREP GEN- 
NETEC contract number 034952. 
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